{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "layout: tutorial\n",
    "title: Embedding Tokens\n",
    "id: embedding-tokens\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook introduces how AllenNLP handles one of the key aspects of applying deep learning techniques to textual data: learning distributed representations of words and sentences.\n",
    "\n",
    "Recently, there has been an explosion of different techniques to represent words and sentences in NLP, including pre-trained word vectors, character level CNN encodings and sub-word token representation (e.g byte encodings). Even more complex learned representations of higher level lingustic features, such as POS tags, named entities and dependency paths have also proven successful for a wide variety of NLP tasks.\n",
    "\n",
    "In order to deal with this breadth of methods for representing words as vectors, AllenNLP introduces 3 key abstractions:\n",
    "\n",
    "- `TokenIndexers`, which generate indexed tensors representing sentences in different ways. See the [Data Pipeline notebook](data_pipeline.ipynb) for more info. \n",
    "\n",
    "- `TokenEmbedders`, which transform indexed tensors into embedded representations. At its most basic, this is just a standard `Embedding` layer you'd find in any neural network library. However, they can be more complex - for instance, AllenNLP has a `token_characters_encoder` which applies a CNN to character level representations.\n",
    "- `TextFieldEmbedders`, which are a wrapper around a set of `TokenEmbedders`. At it's most basic, this applies the `TokenEmbedders` which it is passed and concatenates their output.\n",
    "\n",
    "Using this hierarchy allows you to easily compose different representations of a sentence together in modular ways. For instance, in the Bidaf model, we use this to concatenate a character level CNN encoding of the words in the sentence to the pretrained word embeddings. You can also specify this completely from a JSON file, making experimenation with different representations extremely easy.\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This cell just makes sure the library paths are correct. \n",
    "# You need to run this cell before you run the rest of this\n",
    "# tutorial, but you can ignore the contents!\n",
    "import os\n",
    "import sys\n",
    "module_path = os.path.abspath(os.path.join('../..'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from allennlp.data.fields import TextField\n",
    "from allennlp.data import Instance\n",
    "from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer\n",
    "from allennlp.data.tokenizers import Token\n",
    "\n",
    "words = [\"All\", \"the\", \"cool\", \"kids\", \"use\", \"character\", \"embeddings\", \".\"]\n",
    "sentence1 = TextField([Token(x) for x in words],\n",
    "                      token_indexers={\"tokens\": SingleIdTokenIndexer(namespace=\"token_ids\"),\n",
    "                                      \"characters\": TokenCharactersIndexer(namespace=\"token_characters\")})\n",
    "words2 = [\"I\", \"prefer\", \"word2vec\", \"though\", \"...\"]\n",
    "sentence2 = TextField([Token(x) for x in words2],\n",
    "                      token_indexers={\"tokens\": SingleIdTokenIndexer(namespace=\"token_ids\"),\n",
    "                                      \"characters\": TokenCharactersIndexer(namespace=\"token_characters\")})\n",
    "instance1 = Instance({\"sentence\": sentence1})\n",
    "instance2 = Instance({\"sentence\": sentence2})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to create a small vocabulary from our sentence - note that because we have used both a\n",
    "`SingleIdTokenIndexer` and a `TokenCharactersIndexer`, when we call `Vocabulary.from_instances`, the created `Vocabulary` will have two namespaces, which correspond to the namespaces of each token indexer in our `TextField`'s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 2/2 [00:00<00:00, 7639.90it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is the token_ids vocabulary we created: \n",
      "\n",
      "{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'All', 3: 'the', 4: 'cool', 5: 'kids', 6: 'use', 7: 'character', 8: 'embeddings', 9: '.', 10: 'I', 11: 'prefer', 12: 'word2vec', 13: 'though', 14: '...'}\n",
      "This is the character vocabulary we created: \n",
      "\n",
      "{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'e', 3: 'r', 4: 'h', 5: 'c', 6: 'o', 7: 'd', 8: '.', 9: 'l', 10: 't', 11: 's', 12: 'i', 13: 'u', 14: 'a', 15: 'g', 16: 'A', 17: 'k', 18: 'm', 19: 'b', 20: 'n', 21: 'I', 22: 'p', 23: 'f', 24: 'w', 25: '2', 26: 'v'}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from allennlp.data import Vocabulary\n",
    "from allennlp.data.dataset import Batch\n",
    "\n",
    "# Make \n",
    "instances = [instance1, instance2]\n",
    "vocab = Vocabulary.from_instances(instances)\n",
    "\n",
    "print(\"This is the token_ids vocabulary we created: \\n\")\n",
    "print(vocab.get_index_to_token_vocabulary(\"token_ids\"))\n",
    "\n",
    "print(\"This is the character vocabulary we created: \\n\")\n",
    "print(vocab.get_index_to_token_vocabulary(\"token_characters\"))\n",
    "\n",
    "for instance in instances:\n",
    "    instance.index_fields(vocab)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from allennlp.modules.token_embedders import Embedding, TokenCharactersEncoder\n",
    "from allennlp.modules.seq2vec_encoders import CnnEncoder\n",
    "from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder\n",
    "\n",
    "# We're going to embed both the words and the characters, so we create\n",
    "# embeddings with respect to the vocabulary size of each of the relevant namespaces\n",
    "# in the vocabulary.\n",
    "word_embedding = Embedding(num_embeddings=vocab.get_vocab_size(\"token_ids\"), embedding_dim=10)\n",
    "char_embedding = Embedding(num_embeddings=vocab.get_vocab_size(\"token_characters\"), embedding_dim=5)\n",
    "character_cnn = CnnEncoder(embedding_dim=5, num_filters=2, output_dim=8)\n",
    "\n",
    "# This is going to embed an integer character tensor of shape: (batch_size, max_sentence_length, max_word_length) into\n",
    "# a 4D tensor with an additional embedding dimension, representing the vector for each character.\n",
    "# and then apply the character_cnn we defined above over the word dimension, resulting in a tensor\n",
    "# of shape: (batch_size, max_sentence_length, num_filters * ngram_filter_sizes). \n",
    "token_character_encoder = TokenCharactersEncoder(embedding=char_embedding, encoder=character_cnn)\n",
    "\n",
    "# Notice that these keys have the same keys as the TokenIndexers when we created our TextField.\n",
    "# This is how the text_field_embedder knows which function to apply to which array. \n",
    "# There should be a 1-1 mapping between TokenIndexers and TokenEmbedders in your model.\n",
    "text_field_embedder = BasicTextFieldEmbedder({\"tokens\": word_embedding, \"characters\": token_character_encoder})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we've actually created all the parts which we need to create concatenated word and character CNN embeddings, let's actually apply our `text_field_embedder` and see what happens. \n",
    "\n",
    "First we need to collect our instances into a `Batch` so we can convert them to tensors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Torch tensors for passing to a model: \n",
      "\n",
      " {'sentence': {'tokens': Variable containing:\n",
      "    2     3     4     5     6     7     8     9\n",
      "   10    11    12    13    14     0     0     0\n",
      "[torch.LongTensor of size 2x8]\n",
      ", 'characters': Variable containing:\n",
      "(0 ,.,.) = \n",
      "  16   9   9   0   0   0   0   0   0   0\n",
      "  10   4   2   0   0   0   0   0   0   0\n",
      "   5   6   6   9   0   0   0   0   0   0\n",
      "  17  12   7  11   0   0   0   0   0   0\n",
      "  13  11   2   0   0   0   0   0   0   0\n",
      "   5   4  14   3  14   5  10   2   3   0\n",
      "   2  18  19   2   7   7  12  20  15  11\n",
      "   8   0   0   0   0   0   0   0   0   0\n",
      "\n",
      "(1 ,.,.) = \n",
      "  21   0   0   0   0   0   0   0   0   0\n",
      "  22   3   2  23   2   3   0   0   0   0\n",
      "  24   6   3   7  25  26   2   5   0   0\n",
      "  10   4   6  13  15   4   0   0   0   0\n",
      "   8   8   8   0   0   0   0   0   0   0\n",
      "   0   0   0   0   0   0   0   0   0   0\n",
      "   0   0   0   0   0   0   0   0   0   0\n",
      "   0   0   0   0   0   0   0   0   0   0\n",
      "[torch.LongTensor of size 2x8x10]\n",
      "}}\n",
      "\n",
      "\n",
      "\n",
      "Post embedding with our TextFieldEmbedder: \n",
      "Batch Size:  2\n",
      "Sentence Length:  8\n",
      "Embedding Size:  18\n"
     ]
    }
   ],
   "source": [
    "from allennlp.data.dataset import Batch\n",
    "\n",
    "# Convert the indexed dataset into Pytorch Variables. \n",
    "batch = Batch(instances)\n",
    "tensors = batch.as_tensor_dict(batch.get_padding_lengths())\n",
    "print(\"Torch tensors for passing to a model: \\n\\n\", tensors)\n",
    "print(\"\\n\\n\")\n",
    "# tensors is a nested dictionary, first keyed by the\n",
    "# name we gave our instances (in most cases you'd have more\n",
    "# than one field in an instance) and then by the key of each\n",
    "# token indexer we passed to TextField.\n",
    "\n",
    "# This will contain two tensors: one from representing each\n",
    "# word as an index and one representing each _character_\n",
    "# in each word as an index. \n",
    "text_field_variables = tensors[\"sentence\"]\n",
    "\n",
    "# This will have shape: (batch_size, sentence_length, word_embedding_dim + character_cnn_output_dim)\n",
    "embedded_text = text_field_embedder(text_field_variables)\n",
    "\n",
    "dimensions = list(embedded_text.size())\n",
    "print(\"Post embedding with our TextFieldEmbedder: \")\n",
    "print(\"Batch Size: \", dimensions[0])\n",
    "print(\"Sentence Length: \", dimensions[1])\n",
    "print(\"Embedding Size: \", dimensions[2])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we've manually created the different TokenEmbedders which we wanted to use in our `TextFieldEmbedder`. However, all of these modules can be built using their `from_params` method, so you can have a `TextFieldEmbedder` in your model which is fixed (it encodes some sentence which is an input to your model), but vary the `TokenIndexers` and `TokenEmbedders` which it uses by changing a JSON file."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
